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Abstract 

Background: Most phenotypic effects of drugs are involved in the interactions between drugs and their target 
proteins, however, our knowledge about the molecular mechanism of the drug-target interactions is very limited. 
One of challenging issues in recent pharmaceutical science is to identify the underlying molecular features which 
govern drug-target interactions. 

Results: In this paper, we make a systematic analysis of the correlation between drug side effects and protein 
domains, which we call "pharmacogenomic features," based on the drug-target interaction network. We detect 
drug side effects and protein domains that appear jointly in known drug-target interactions, which is made 
possible by using classifiers with sparse models. It is shown that the inferred pharmacogenomic features can be 
used for predicting potential drug-target interactions. We also discuss advantages and limitations of the 
pharmacogenomic features, compared with the chemogenomic features that are the associations between drug 
chemical substructures and protein domains. 

Conclusion: The inferred side effect-domain association network is expected to be useful for estimating common 
drug side effects for different protein families and characteristic drug side effects for specific protein domains. 




Background 

Most phenotypic effects of drugs are involved in the 
interactions between drugs and their target proteins 
(drug-target interactions hereafter). Drug molecules 
often interact not only with a therapeutic target but also 
with the other proteins (off-targets hereafter), which 
could lead to unwanted side effects [1]. Therefore, the 
identification of overall target proteins of drugs includ- 
ing the therapeutic targets and off-targets is a crucial 
process in the drug development. In addition, the 
understanding of the molecular mechanism of drug phe- 
notypic effects in terms of drug-target interactions is 
also an important issue in many pharmaceutical applica- 
tions. There is a hypothesis that drug phenotypic effects 
are involved in many kinds of biological features of 
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drugs and proteins (e.g., drug chemical substructures, 
pharmacophores, protein functional sites, and biological 
pathways). 

Recently, a variety of computational methods have 
been developed for large-scale prediction of drug-target 
interactions in the context of chemogenomics or phar- 
macogenomics. The key idea of the chemogenomic 
approach is that chemically similar compounds are likely 
to interact with similar proteins, and the prediction is 
performed based on compound chemical structures 
and/or protein sequences [2-7]. In contrast, the key idea 
of the pharmacogenomic approach is that phenotypically 
similar drugs are likely to interact with similar proteins, 
and the prediction is performed based on drug side 
effects and/or protein sequences [8-10]. However, the 
predictive models of most previous methods are not 
biologically interpretable, which makes it difficult to 
interpret biological features of drug-target interactions 
or compound-protein interactions. 
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The identification of biological features which are 
associated with drug-target interactions or compound- 
protein interactions is becoming a challenging issue in 
recent pharmaceutical science. In the context of che- 
mogenomics, some machine learning methods with 
sparse models have been proposed to detect informa- 
tive combinations of drug chemical substructures and 
protein domains which may explain the mechanism of 
drug-target interactions. The inferred features are 
called "chemogenomic features" [11,12]. In addition, 
the use of a data mining method has been proposed to 
infer molecular substructure pairs which appear fre- 
quently and significantly in interacting drug-target 
pairs [13]. The next challenge is to relate drug-target 
interactions with drug phenotypic effects (e.g., pharma- 
ceutical effects and side effects). Recently, the use of 
drug targeted proteins has been proposed for predict- 
ing drug side effects [14-16]. The inference of proteins 
associated with drug side effects has been proposed 
[14,17], but there is no previous work on the analysis 
at the protein domain level. Protein domains are struc- 
tural, evolutional, and functional units, so it would be 
important to investigate the associations between pro- 
tein domains and drug side effects on a large scale. 

In this paper, we make a systematic analysis of the 
correlation between drug side effects and protein 
domains, which we call "pharmacogenomic features," 
based on the drug-target interaction network. We detect 
drug side effects and protein domains that appear jointly 
and in known drug-target interactions, which is made 
possible by using classifiers with sparse models. It is 
shown that the inferred pharmacogenomic features can 
be used for predicting potential drug-target interactions. 
We also discuss advantages and limitations of the phar- 
macogenomic features, compared with the chemoge- 
nomic features that are the associations between drug 
chemical substructures and protein domains. To our 
knowledge, no other computational method has been 
reported for relating protein domains with drug side 
effects. The inferred side effect-domain association net- 
work is expected to be useful for estimating common 
drug side effects for different protein families and char- 
acteristic drug side effects for specific protein domains. 

Results and discussion 

Inference of pharmacogenomic features 

We applied four methods: LILOG, L2LOG, LILOG-C, 
and L2LOG-C to infer pharmacogenomic features from 
the drug-target interaction network. Note that LILOG 
and L2LOG are respectively ii- and Z,2-regularized 
logistic regressions with the tensor product descriptors, 
and LlLOG-C and L2LOG-C are respectively L^- and 
i2-'"egularized logistic regressions with the combined 
descriptors (see the Methods section for more details). 



In each method, we inferred pharmacogenomic features 
with positive weights in the predictive model. 

Each pharmacogenomic feature is composed of a side 
effect and a protein domain which are thought of as 
being associated with each other. There is a tendency 
that the protein domain in a pharmacogenomic feature 
are present in the same protein family targeted by drugs 
causing the side effect within the corresponding phar- 
macogenomic feature. We quantitated degree of the 
associations between side effects and protein domains 
by evaluating the weights on the corresponding pharma- 
cogenomic features. Figure 1 shows a small part of side 
effect-domain association network inferred by the 
LILOG method, where edges are placed between side 
effects and protein domains in positively weighted phar- 
macogenomic features and the top 200 weights are 
selected in the picture because of space limitation. 

We investigated the number of pharmacogenomic fea- 
tures inferred by the four methods. Figure 2 shows a 
summary of the comparison between LILOG, L2LOG, 
LILOG-C, and L2LOG-C on the gold standard data. It 
is found that the numbers of inferred features in 
LILOG and LILOG-C are significantly fewer than those 
in L2LOG and L2LOG-C, respectively. This observation 
means that the sparsity induced by the Li penalty has 
positive effects of reducing the number of features in 
the descriptors. The feature extraction property helps us 
to analyze the inferred features for biological interpreta- 
tion in practice. 

Figure 3 shows the overlap of inferred pharmacoge- 
nomic features between the four methods. It is found 
that LILOG was able to infer a very limited number of 
features and most of the features were included in those 
of L2LOG. This result suggests that the inferred features 
of LILOG are more representative than those of 
L2LOG. Both LILOG and LILOG-C are sparsity- 
induced methods, but the number of common features 
between the two methods was very limited. This result 
suggests that biological interpretation about the inferred 
features may depend on the descriptors designed for 
drug-target pairs. 

Reconstruction of known drug-target interactions 

We examined the validity of the pharmacogenomic 
features inferred by LILOG, L2LOG, LILOG-C, and 
L2LOG-C in terms of generalization properties for 
drug-target interactions. In order to test the ability of 
each method to reconstruct known drug-target interac- 
tions from the features, we performed two types of 
cross-validation experiments: pair-wise cross-validation 
and block-wise cross-validation (see the Methods sec- 
tion for more details). We also made a comparison 
between the pharmacogenomic features and the che- 
mogenomic features. Note that the chemogenomic 
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Figure 1 Part of the inferred side effect and protein domain network. Green circle and red rectangle represent a side effect and a protein 
domain, respectively. Node size represents a node degree. Edge width represents the weight of side effect and protein domain pair 



features correspond to the associations between drug 
chemical substructures and protein domains [12]. The 
methods with the pharmacogenomic features and the 
chemogenomic features are referred to as pharmacoge- 
nomic approach and chemogenomic approach, respec- 
tively, below. 



We evaluated the performance by using the ROC 
curve (receiver operating characteristic curve). The ROC 
curve is a function of true positive rates against false 
positive rates based on many thresholds for the predic- 
tion scores, where true positives are correctly predicted 
interactions and false positives are incorrectly predicted 
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Figure 4 AUC scores and the number of features in the cross-validation experiments by LILOG and L2L0G The left panels show the 
results of the pair-wise cross-validation, while the right panels show the results of the block-wise cross-validation. The horizontal axis of each 
panel indicates the chemical structure similarity used for the clustering threshold in constructing the benchmark data. 



interactions. We computed the total AUC score (the 
area under the ROC curve) over the five folds. 

Figure 4 shows the resulting AUC scores and the 
number of inferred features by LILOG and L2LOG 
based on nine benchmark datasets with different cluster- 
ing thresholds (see the Method section for more details). 
It is found that LILOG is able to infer a smaller number 
of features, compared with L2LOG in most cases. Inter- 
estingly, the prediction accuracy of LILOG is kept to 
some extent. Another observation is that the block-wise 
cross-validation provides lower AUC scores, compared 
with the pair-wise cross-validation. This result suggests 
that target prediction for new drugs with no target 
information and ligand prediction for orphan proteins 
with no ligand information are quite difficult problems. 

Pharmacogenomic approach and chemogenomic 
approach showed similar behaviors in the pair-wise 
cross-validation setting, while the two approaches 
showed different behaviors in the block-wise cross-vali- 
dation setting. The performance of pharmacogenomic 



approach was better than that of chemogenomic 
approach for the benchmark data consisting of structu- 
rally different drugs (i.e., in the case of low chemical 
similarity thresholds). On the other hand, the perfor- 
mance of pharmacogenomic approach was worse than 
the chemogenomic approach for benchmark data con- 
taining many structurally similar drugs (i.e., in the case 
of high chemical similarity thresholds). For example, the 
pharmacogenomic approach worked well for the Test- 
Drug-TrainProtein pairs when the chemical similarity 
threshold is 0.2, and for the TestDrug-TestProtein pairs 
when the chemical similarity threshold lies in the range 
0.2-0.5. 

Figure 5 shows the resulting AUC scores and the num- 
ber of inferred features for LILOG-C and L2LOG-C 
based on nine benchmark datasets. Note that LlLOG-C 
and L2LOG-C are based on the combination of the ten- 
sor product descriptor and individual feature vectors of 
drugs and target proteins. Similar tendencies exhibited in 
LILOG and L2LOG can be observed in LILOG-C and 
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Figure 5 AUC scores and the number of features in the cross-validation experiments by LILOG-C and L2L0G-C The left panels show 
the results of the pair-wise cross-validation, while the right panels show the results of the block-wise cross-validation. The horizontal axis of each 
panel indicates the chemical structure similarity used for the clustering threshold in constructing the benchmark data. 



L2LOG-C as well. However, the AUC scores of LILOG- 
C and L2LOG-C tend to be higher than those of LILOG 
and L2LOG in both the pharmacogenomic approach and 
the chemogenomic approach. This result suggests that 
the combination of the tensor product descriptor and 
individual feature vectors of drugs and target proteins is 
meaningful for predicting drug-target interactions. 

Biological interpretation of side effects and protein 
domains in the inferred pharmacogenomic features 

We made biological interpretations for drug side effects 
and protein domains in the inferred pharmacogenomic 
features. Table 1 shows examples of highly weighted 
side effect-domain pairs in the pharmacogenomic fea- 
tures inferred by the LILOG method. Table 2 shows 
examples of highly weighted protein domains for each 
side effect inferred by the LILOG method. Table 3 
shows examples of highly weighted side effects for each 
protein domain inferred by the LILOG method. The 



inferred side effect-domain associations suggest potential 
side effects involving each protein domain and side 
effects for a wide range of protein families. The results 
of all inferred pharmacogenomic features in Tables 1, 2, 
and 3 can be obtained from Additional files 1, 2, and 3 
in supplemental materials. Here we discuss some exam- 
ples in Tables 1, 2, and 3. 

7TM_GPCR_Srw, 7TM_GPCR_Srbc, and 7TM_GPCR_ 
Srsx are the Serpentine type 7TM GPCR chemorecep- 
tors, which are the members of seven-transmembrane 
G -protein-coupled receptors (7TM GPCRs) that involved 
in many diseases and are also the target of many modern 
medicinal drugs. Srw, Srbc, and Srsx are the solo families 
amongst the superfamilies of chemoreceptors. It is rea- 
sonable to find that these three families share some side 
effects such as mental depression, nasal congestion, and 
priapism, however it might be meaningful to find that 
these families have their own specific side effects; short- 
ness of breath, weakness, and ptosis for Srw, labyrinthitis. 
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Table 1 Examples of highly weighted pharmacogenomic features inferred by the LI LOG method 



Rank 


Weight 




Protein domdin ID 


Protein domsin definition 


1 
1 




BDSphylaxis 


rr 1 ODDo 


SbcCD_C (Putative exonuclease SbcCD, C subunit) 


2 


j jU.UUUU 


tardive dyskinesia 


rr 1 U^zU 


7TM„GPCR_Srsx (Serpentine type 7TM GPCR chemoreceptor Srsx) 


3 


OoU.UUUU 


labyrinthitis 


DC 1 Cxi) 1 1~ 

rr 1 Uj 1 O 


7TM_GPCR_Srbc (Serpentine type 7TM GPCR chemoreceptor Srbc) 


4 


OoU.UUUU 


shortness of breath 


rrU/yyo 


NO_synthase (Nitric oxide synthase, oxygenase domain) 




J jU.UUUU 


priapism 


pr 1 nQ9/i 
r r 1 \jdzh 


7TM_GPC_Srw (Serpentine type 7TM GPCR chemoreceptor Srw) 


0 


5 5n nnnn 


nasal congestion 


DC 1 n^o/i 


/ 1 ivi_'jr'^n_jrw perperitiric type / iivi urL-n cnerTioreceptor jrw; 


7 


OoU.UUUU 


nasal congestion 


DC1 n^ 1 
rr lUo 1 D 


7TM_GPCR_Srbc (Serpentine type 7TM GPCR chemoreceptor Srbc) 


8 


330.0000 


weal<ness 


PF10324 


7TM_GPCR_Srw (Serpentine type 7TM GPCR chemoreceptor Srw) 


9 


327.1335 


burning sensation 


PF10316 


7TM_GPCR_Srbc (Serpentine type 7TM GPCR chemoreceptor Srbc) 


10 


326.7361 


glycosuria 


PF00191 


Annexin (Annexin) 


11 


325.3147 


shortness of breath 


PF10324 


7TM_GPCR„Srw (Serpentine type 7TM GPCR chemoreceptor Srw) 


12 


324.4456 


shortness of breath 


PF02931 


Neur_chan_LBD (Neurotransmitter-gated ion-channel ligand binding domain) 


13 


320.0000 


anaphylaxis 


PF00060 


Lig_chan (Ligand-gated ion channel) 


14 


292.0382 


hypoventilation 


PF02898 


NO_synthase (Nitric oxide synthase, oxygenase domain) 


15 


281.0425 


hypoventilation 


PF02931 


Neur_chan_LBD (Neurotransmitter-gated ion-channel ligand binding domain) 



burning sensation and torticollis for Srbc, and tardive 
dyskinesia, hyperprolactinemia, and Parkinson for Srsx. 

Neur_chan_LBD (Neurotransmitter-gated ion-channel 
ligand binding domain) is a transmembrane receptor- 
ion channel complex that binds specific ligands for 
rapid transmission of signals at chemical synapses, 
which includes nicotinic acetylcholine receptor (AchR), 
glycine receptor, gamma-aminobutyric acid (GABA) 
receptor, serotonin 5HT3 receptor, and glutamate recep- 
tor. By viewing the side effects on the protein domain 
level, this domain was shown to be involved in many 
side effects, such as shortness of breath, respiratory 
arrest, blindness, hypoventilation, increased salivation, 
drug dependence, and proctitis. It is understandable that 
most of these side-effects are shared by NO_synthase 
(Nitric oxide synthase, oxygenase domain). NO_synthase 
has isoenzymes eNOS (endothelial NOS) and nNOS 
(neuronal NOS); the former is the primary signal gen- 
erator in the control of vascular tone, insulin secretion, 
and airway tone, and the latter is involved in the devel- 
opment of nervous system. 

Novel predictions 

Finally, we conducted a large-scale prediction of 
unknown interactions of all drugs and all proteins based 
on the pharmacogenomic features inferred by LILOG. 
We learned a predictive model based on all drug-target 
pairs in the gold standard data, and applied it to all 
drugs and proteins for which side-effect information 
and domain information are available. We put the hst of 
the top 1000 predictions in Additional file 4. 

Conclusion 

In this paper we made a systematic analysis of the cor- 
relation between drug side effects and protein domains, 



which we call pharmacogenomic features, using binary 
classifiers with sparse models based on the drug-target 
interaction network. We showed the usefulness of the 
inferred pharmacogenomic features for predicting drug- 
target interactions. To our knowledge, this work is the 
first study to relate protein domains with drug side 
effects on a large scale. 

In this study, we used logistic regression as a binary 
classifier, but other classifiers can be used for the same 
objective. For example, support vector machine (SVM) 
is a good candidate for high-performance binary classi- 
fier. Actually, we performed the same analysis using 
SVM in a similar manner as logistic regression, and 
confirmed that the same tendency in the results can be 
obtained. The detailed results can be found in Addi- 
tional files 5 and 6. 

In this study we used side effect profiles of drugs and 
domain profiles of target proteins in the correlation ana- 
lysis, but the performance and the biological interpreta- 
tion depend heavily on the elements in the profiles of 
drugs and proteins. The method can not extract features 
which are absent from the predefined descriptors, so the 
generalization properties of the method could be 
improved by constructing more meaningful descriptors 
or using more complete descriptors. 

Materials and methods 

Data 

Drug-target interactions 

We obtained the information about drug-target interac- 
tions from the DrugBank database [18]. The number of 
interactions in the dataset is 1064. These interactions 
involve 413 drugs and 400 target proteins. We used this 
data set as gold standard data in the cross-validation 
experiment. 
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Table 2 Examples of highly weighted protein domains for each side effect inferred by the LI LOG method 


Rank 


Weight 


Side effect 


Protein domain ID 


Protein domain definition 


1 


330.0000 
320.0000 
115.8250 

113.4802 
749419 


anaphylaxis 


PF13558 
PF00060 
PF08491 
PF00989 
PF05790 


SbcCD_C (Putative exonuclease SbcCD, C subunit) 
Lig_chan (Ligand-gated ion channel) 
SE (Squalene epoxidase) 
PAS (PAS fold) 
C2-set (Immunoglobulin C2-set domain) 


2 


330.0000 


tardive dysl<inesia 


PF 10320 


7TM_GPCR_Srsx (Serpentine type 7TM GPCR chemoreceptor Srsx) 


3 


330.0000 
70.4820 


labyrinthitis 


PF10316 
PF10140 


7TM_GPCR_Srbc (Serpentine type 7TM GPCR chemoreceptor Srbc) 
YukC (WXGIOO protein secretion system (Wss), protein YukC) 


4 


330.0000 
325.3147 
324.4456 
100.7863 


shortness of breath 


PF02898 
PF 10324 
PF02931 
PF 10320 


NO_synthase (Nitric oxide synthase, oxygenase domain) 
7TM_GPCR_Srw (Serpentine type 7TM GPCR chemoreceptor Srw) 
Neur_chan_LBD (Neurotransmitter-gated ion-channel ligand binding domain) 
7TM_GPCR_Srsx (Serpentine type 7TM GPCR chemoreceptor Srsx) 


5 


330.0000 
112.9922 
2.0630 


priapism 


PF10324 
PF10316 
PF00206 


7TM_GPCR_Srw (Serpentine type 7TM GPCR chemoreceptor Srw) 
7TI\/l_GPCR_Srbc (Serpentine type 7TM GPCR chemoreceptor Srbc) 
Lyase_1 (Lyase) 


6 


330.0000 
3 jO.OCCO 


nasal congestion 


PF10324 
PF;03it) 


7TM_GPCR_Srw (Serpentine type 7TM GPCR chemoreceptor Srw) 
/"!\'_GPCI<_3~t!C ijcrpcntinc type /TM Gl-Ll' clicmo^cccpto- S-bc) 


7 


330.0000 


weakness 


PF10324 


7TM_GPCR_Srw (Serpentine type 7TM GPCR chemoreceptor Srw) 


8 


327.1335 
14.2660 
74168 
5.7520 
03741 


burning sensation 


PF10316 
PF02867 
PF03522 
PF10324 
PF00209 


7TM_GPCR_Srbc (Serpentine type 7TM GPCR chemoreceptor Srbc) 
Ribonuc_red_lgC (Ribonucleotide reductase, barrel domain) 
KCLCotrans_l (K-CI Co-transporter type 1 (KCCl)) 
7TM_GPCR_Srw (Serpentine type 7TM GPCR chemoreceptor Srw) 
SNF (Sodium:neurotransmitter symporter family) 


9 


326.7361 
145.3436 
49.9279 

-18.0767 
32.5342 


glycosuria 


PF00191 
PF08377 
PF03491 
PF02222 
PF00698 


Annexin (Annexin) 
l\/lAP2_projctn (MAP2/Tau projection domain) 
5HT_transporter (Serotonin (5-HT) neurotransmitter transporter, N-terminus) 
ATP-grasp (ATP-grasp domain) 
Acyl_transf_1 (Acyl transferase domain) 


10 


292.0382 
281.0425 


hypoventilation 


PF02898 
PF02931 


NO_synthase (Nitric oxide synthase, oxygenase domain) 
Neur_chan_LBD (Neurotransmitter-gated ion-channel ligand binding domain) 



Pharmacological and chemical data of drugs 

We obtained the information about side effects of drugs 
from the SIDER database that accumulates reported side 
effects from package inserts for marketed drugs [19]. 
We represented each drug by a 1179-dimensional binary 
vector in which the presence or absence of each side 
effect is coded as 1 or 0. 

We obtained the information about chemical struc- 
tures of drugs from the PubChem database [20]. We 



represented each drug by an 881-dimensional binary 
vector in which 881 PubChem substructures are used 
and the presence or absence of each substructure is 
coded as 1 or 0. 

Genomic and functional data of target proteins 

We obtained genomic information about target proteins 
from the UniProt database [21], and obtained the protein 
domains from the PFAM database [22]. We represented 
each target protein by a 476-dimensional binary vector in 
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Table 3 Examples of highly weighted side effects for each protein domain inferred by the LI LOG method 



Rank 


Weight 


Protein domain ID 


Protein domain definition 


Side effect 


1 




PF13558 


SbcCD_C (Putative exonuclease SbcCD, C subunit) 






330.0000 






anaphylaxis 




41.1178 






allergic reaction 


2 


330.0000 
216.3340 
102.3075 
100.7863 
90.8773 


PF10320 


7TM_GPCR_Srsx (Serpentine type 7TM GPCR cliemoreceptor Srsx) 


tardive dyskinesia 
mental depression 

regurgitation 
shortness of breath 
hyperprolactlnemia 


3 


330.0000 
330.0000 
327.1335 
266.0200 
225.7442 


PF10316 


7TM_GPCR_Srbc (Serpentine type 7TM GPCR cliemoreceptor Srbc) 


labyrinthitis 
nasal congestion 
burning sensation 

torticollis 
testicular swelling 


4 


330.0000 
292.0382 
232.9827 
229.9422 
204.4183 


PF02898 


NO_synthase (Nitric oxide syntliase, oxygenase domain) 


shortness of breath 
hypoventilation 
respiratory arrest 

increased salivation 
blindness 


5 


330.0000 

330.0000 
330.0000 
325.3147 
279.9735 


PF 10324 


7TIVl_GPCR_Srw (Serpentine type 7TM GPCR cliemoreceptor Srw) 


priapism 
nasal congestion 
weakness 
shortness of breath 
ptosis 


6 


326.7361 
117.1047 
95.9706 
79.3155 
71.9669 


PF00191 


Annexin (Annexin) 


glycosuria 

vasculitis 
sialadenitis 
COPD 
choking 


7 


324.4456 

281.0425 
244.4901 
228.2349 
221.8014 


PF02931 


Neur_clian_LBD (Neurotransmitter-gated ion-cliannel ligand binding domain) 


shortness of breath 
hypoventilation 

increased salivation 
respiratory arrest 

drug dependence 


8 




PF00060 


Lig-clnan (Ligand-gated ion cliannel) 






320.0000 






anaphylaxis 




19.4911 






allergic reaction 


9 


251.7744 
145.3436 
47.3304 
44.6506 
36.2648 


PF08377 


MAP2_projctn (I\/1AP2/Tau projection domain) 


hyperuricemia 
glycosuria 
sialadenitis 
choking 
polydipsia 
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Table 3 Examples of highly weighted side effects for each protein domain inferred by the LI LOG method (Continued) 



10 PF14396 CFTR_R (Cystic fibrosis TIM conductance regulator (CFTR), regulator domain) 

239.2866 gallbladder disease 

2 1 2.0452 gynecomastia 
156.3952 paronychia 
134.2967 prostatism 
131.3925 cervical erosion 



which 476 PFAM domains are used and the presence or 
absence of each domain is coded as 1 or 0. 

Classifiers for drug-target pairs 

We consider the feature extraction problem in the con- 
text of drug-target interaction prediction. We represent a 
pair of drug D and protein P by [D, P). Suppose that we 
are given a learning set of drug-target pairs (D„ P^ {i = 1, 
2, . . . , no; 7 = 1, 2, ... , np), where the pairs are known 
to interact or not, no is the number of drugs and np is 
the number of target proteins in the learning set. 

We represent a pair of drug D and protein P by a fea- 
ture vector 0(Z), P), and then estimate a function / (D, 
P) - w^O(Z), P) which would predict whether drug-tar- 
get pair {D, P) is an interacting pair or not. We optimize 
the weight vector w based on the learning set with label 
information. 

The feature vector of drug D is supposed to be repre- 
sented as an M-dimensional binary vector: 

a>(D) = {di,d2,...,duf, 

where dj,& {0, 1}, ^ = 1, . . . , M . For example, <I'(D) 
is a profile of side effects or chemical substructures in 
this study. In the same manner, the feature vector of 
protein P is supposed to be represented as an AT-dimen- 
sional binary vector: 

$(P) = {pi,pi, ...,pNf , 

where e {Q, I}, I = \, . . . , N . For example, <S>(P) is 
a profile of protein domains in this study. 

We propose two kinds of feature vectors for each 
drug-target pair. First, we represent each drug-target 
pair by the tensor product between <!>(£)) and O(i') as 
follows: 

$ (D, P) = O (D) (8) O (P) 

= {d\p\, ...,d\pKi, ...,dM pi, —idMpNf, 

where <!>(£), P ) is an (M x AT) -dimensional feature 
vector. We refer to the feature vector as "tensor product 
descriptor" in this study. This tensor product descriptor 
is similar to that in the previous work [12]. 

Second, we represent each drug-target pair by 
the combination of the tensor product descriptor 



<I>(Z)) 0 <I>(P ) and individual feature vectors <I>(Z)) and 
3)(P) as follows: 

*c (D, P) = [(*(D) ® <I>(P)f , *(Df , <I>(Pf f 

= (dip\, ...,dipN, ...,dupi, ■■■,dupN,di,d2, ■■■,du,p\,p2, -ipNf , 

where 0^ P) is an (M x A/^ + M + A/) -dimensional 
binary vector. We refer to the feature vector as "com- 
bined descriptor" in this study. 

In this study we use logistic regression as a binary 
classifier to predict whether a drug D interacts with a 
target protein P or not. The predictive model is usually 
learnt by minimizing the loss function with L2-regulari- 
zation. However, L2-regularization tends to keep most 
weight elements to be non-zeros, which makes it diffi- 
cult to interpret features from the resulting weight vec- 
tor. Another possible solution is to use ii-regularization 
that tends to make most weight elements to be zeros, 
which makes it easier to interpret features from the 
resulting weight vector. Therefore, we introduce a logis- 
tic regression model with Li-regularization. 

Suppose that we have a learning set of drug-target 
pairs and interaction labels 

mOi, Pj), yij), ytj e {+1, -1} (i = 1, 2, . . . , no, j = 1, 2, 
. . . , Hp), where «£i is the number of drugs and «p is the 
number of target proteins in the learning set. The 
weight vector w of the linear logistic regression is 
usually learned with La-regularization as follows: 

"D n,, 

min 1 1 u;| |2 + C V V log (1 + exp {-yi^ uj"^ {Du Pj))) , 
1=1 j=i 

where || • II2 is norm (the sum of squared values) 
and C is a regularization parameter to control the 
penalty. 

To induce sparsity in the model, the weight vector w 
of the linear logistic regression is learned with ii-regu- 
larization as follows: 

no np 

min 1 1 u;| 1 1 + C V V log (1 + exp w"^ (A, P,))) , 

i-i i-i 

where || • ||i is Li norm (the sum of absolute values) 
and C is a regularization parameter to control the spar- 
sity. We examine various values (0.0001, 0.001, 0.01, 0.1, 
1, 10, 100, 1000, 10000) for the hyper parameter C, and 
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select the value that gave the highest AUC score in the 
cross-validation experiment. 

In practice, we consider applying the logistic regres- 
sion with the tensor product descriptor <!>(£), P) and 
with the combined descriptor <I>c(£>, -P). We refer to Z-i- 
regularized logistic regression with the tensor product 
descriptor as "LILOG," and i2-regularized logistic 
regression with the tensor product descriptor as 
"L2LOG," respectively. We refer to Li-regularized logis- 
tic regression with the combined descriptor as "LILOG- 
C," and L2-regularized logistic regression with the com- 
bined descriptor as "L2LOG-C," respectively. 

Cross-validation experiments for benchmarl< data 

There are two scenarios for drug-target interaction pre- 
diction from practical viewpoints. The first scenario is 
that we have drugs with target information and proteins 
with ligand information, and the goal is to additionally 
detect missing interactions between the drugs and the 
proteins. The second scenario is that we have drugs 
with no target information and protein with no ligand 
information, and the goal is to find all potential target 
proteins of the drugs and all potential ligands of target 
proteins. To simulate the above two scenarios in the 5- 
fold cross-validation experiment, we consider two differ- 
ent settings: pair-wise cross-validation and block-wise 
cross-validation. 

The pair-wise cross-validation consists of the following 
procedures: First, we split all drug-target pairs in the 
gold standard set into five subsets of all drug-target 
pairs in an independent manner. Second, we regard 
each subset of drug-target pairs as a test set, and regard 
the other four subsets of drug-target pairs as a training 
set. Third, we optimize a predictive model based on 
drug-target pairs in the training set. Finally, we apply 
the predictive model to drug-target pairs in the test set. 
Note that drug-target pairs are considered independent 
of each other, so drugs and target proteins in test pairs 
are overlapped with those in the training set to some 
extent. 

The block-wise cross-validation consists of the follow- 
ing procedures: First, we split drugs and target proteins 
in the gold standard set into five subsets of drugs and 
five subsets of target proteins. Second, we regard each 
subset of drugs (resp. proteins) as test drugs (resp. test 
proteins), and use the other four subsets of drugs as 
training drugs (resp. training proteins). Third, we opti- 
mize a predictive model based on drug-target pairs con- 
sisting of training drugs and training proteins. Finally, we 
compute the prediction scores for three types of drug-tar- 
get pairs: test drugs v.s. training target proteins (referred 
to as "TestDrug-TrainProtein"), training drugs v.s. test 
target proteins (referred to as "TrainDrug-TestProtein"), 



and test drugs v.s. test target proteins (referred to as 
"TestDrug-TestProtein"). Note that drugs and proteins in 
test pairs are not completely different from those in the 
training set. Thus, the prediction problem in the block- 
wise setting is more difficult than that in the pair-wise 
setting. 

The gold standard data contain many drugs which were 
chemically and structurally almost identical, because they 
were derived from the same lead compound. If these 
identical drugs were divided into a training set and a test 
set, the prediction in the cross-validation experiment 
would be trivial. To avoid overestimation of the predic- 
tion accuracy, we perform a grouping of similar drugs 
based on their chemical structures and use only drugs 
which are chemically and structurally different to some 
extent, following a previous work [10]. First, we carry out 
a clustering of all drugs based on Tanimoto coefficient 
(Jaccard coefficient) [23] of chemical fingerprints using 
average linkage algorithm. Second, we cluster drugs with 
high Tanimoto coefficients into the same cluster, and 
selected one representative drug within each cluster. 
Third, we construct a set of drugs with low Tanimoto 
coefficients. Finally, we prepare nine sets of benchmark 
data consisting of representative drugs by varying the 
clustering threshold little by little (e.g., from 0.2 to 1.0 by 
0.1) on the dendrogram. When the clustering threshold 
is 0.1, the number of drug clusters is only 3 in our data, 
so it is not possible to test the clustering threshold of 0.1 
in the 5-fold cross-validation. 

Additional material 



Competing interests 

The authors declare that they have no competing interests. 
Authors' contributions 

HI tested the performance of the methods and drafted the manuscript. SM 
prepared the datasets and made biological interpretations of the results. YT 
implemented the algorithm of the methods. MK made biological 
interpretations of the results and drafted the manuscript. SG drafted the 
manuscript. YY directed the work, and drafted the manuscript. All authors 
read and approved the final manuscript 



Additional file 1: Extracted side effects and protein domains of 

LILOG. 

Additional file 2: Extracted protein domains for each side effect of 
LILOG. 

Additional file 3: Extracted side effects for each protein domain of 

LILOG. 

Additional file 4: The list of novel drug-target predictions of LI LOG. 
Additional file 5: AUC scores and the number of features in the 
pair-wise and block-wise cross-validation experiments by LI LOG, 
L2L0G, L1SVM, and LISVM. 

Additional file 6: AUC scores and the number of features In the 
pair-wise and block-wise cross-validation experiments by L1L0G-C, 
L2L0G-C, L1SVM-C, and L2SVIVI-C. 



Iwata et al. BMC Systems Biology 2013, 7(Suppl 6):S18 
http://www.biomedcentral.eom/1752-0509/7/S6/S18 



Page 11 of 1 1 



Acknowledgements 

This work was supported by MEXT/JSPS KAKENHI Grant Numbers 25108714, 
24700140, and 25700029. This work was also supported by the Program to 
Disseminate Tenure Tracking System, MEXT, Japan, and Kyushu University 
Interdisciplinary Programs in Education and Projects in Research 
Development. 

Declarations 

The publication cost for this work was supported by JSPS Kakenhi 25700029. 
This article has been published as part of BMC Systems Biology Volume 7 
Supplement 6, 2013: Selected articles from the 24th International Conference 
on Genome Informatics {GIW2013). The full contents of the supplement are 
available online at http://www.biomedcentraLcom/bmcsystbiol/supplements/ 
7/S6. 

Authors' details 

^Division of System Cohort Multi-scale Research Center for Medical Science, 
Medical Institute of Bioregulation, Kyushu University, 3-1-1 Maidashi, Higashi- 
ku, Eukuoka, Eukuoka 812-8582, Japan. ^Bioinfromatics Center, Institute for 
Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan. 
^PRESTO, Japan Science and Technology Agency, Kawaguchi, Saitama 332- 
0012, Japan, ''institute for Advanced Study, Kyushu University, 6-10-1, 
Hakozaki, Higashi-ku, Eukuoka, Eukuoka 812-8581, Japan. 

Published: 13 December 2013 



testing of drug activity on side-effect targets. Nature 2012, 

486(7403)361-367. 

17. Kuhn M, Al Banchaabouchi M, Campillos M, Jensen LJ, Gross C, Gavin AC, 
Bork P: Systematic identification of proteins that elicit drug side effects. 
/Wo/ecu/or systems biology 2013, 9. 

18. Knox C, Law V, Jewison T Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, 
Neveu V, et ah DrugBank 3.0: a comprehensive resource for ?omics? 
research on drugs. Nucleic acids research 2011, 39{suppl 1):D1035-D1041. 

19. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P: A side effect resource to 
capture phenotypic effects of drugs. iVIolecular systems biology 2010, 6. 

20. Chen B, Wild D, Guha R: PubChem as a source of polypharmacology. J 
Chem Inf Model 2009, 49:2044-2055. 

21. Consortium TU: The Universal Protein Resource (UniProt) in 2010. Nucleic 
Acids Res 2010, 38:D142-D148. 

22. Einn R, Tate J, Mistry J, Coggill P, Sammut J, Hotz H, Ceric G, Forslund K 
Eddy S, Sonnhammer E, Bateman A: The Pfam protein families database. 
Nucleic Acids Res 2008, 36:D281-D288. 

23. Tanimoto TT: An elementary matt]ematical tlieory of classification and 
prediction 1958, International Business Machines Corporation. 



doi:l 0.1 1 86/1 752-0509-7-S6-S1 8 

Cite this article as: Iwata et al:. Inferring protein domains associated with 
drug side effects based on drug-target interaction network. BMC Systems 
Biology 20]3 7(Suppl 6):S18. 



References 

1. Whitebread S, Hamon J, Bojanic D, Urban L: Keynote review: In vitro safety 
pharmacology profiling: an essential tool for successful drug 
development Drug discovery today 2005, 10(21):1421-1433. 

2. Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M: Prediction of 
drug-target interaction networks from the integration of chemical and 
genomic spaces. Bioinformatics 2008, 24(13):i232-i240. 

3. Faulon J, Misra M, Martin S, Sale K, Sapra R: Genome scale enzyme- 
metabolite and drug-target interaction predictions using the signature 
molecular descriptor. Bioinformatics 2008, 24:225-233. 

4. Jacob L Vert J: Protein-ligand interaction prediction: an improved 
chemogenomics approach. Bioinformatics 2008, 24:2149-2156. 

5. Bleakley K, Yamanishi Y: Supervised prediction of drug-target interactions 
using bipartite local models. Bioinformatics 2009, 25(18):2397-2403. 

6. Keiser MJ, Setola V, Irwin JJ, Laggner C, Abbas Al, Hufeisen SJ, Jensen NH, 
Kuijer MB, Matos RC, Tran TB, et al: Predicting new molecular targets for 
known drugs. Nature 2009, 462(7270):175-181. 

7. van Laarhoven T, Nabuurs SB, Marchiori E: Gaussian interaction profile 
kernels for predicting drug-target interaction. Bioinformatics 201 1, 
27(21):3036-3043. 

8. Campillos M, Kuhn M, Gavin AC, Jensen LJ, Bork P: Drug target 
identification using side-effect similarity. Science 2008, 321(5886):263-266. 

9. Yamanishi Y, Kotera M, Kanehisa M, Goto S: Drug-target interaction 
prediction from chemical, genomic and pharmacological data in an 
integrated framework. Bioinformatics 2010, 26(12):i246-i254. 
Takarabe M, Kotera M, Nishimura Y, Goto S, Yamanishi Y: Drug target 
prediction using adverse event report systems: a pharmacogenomic 
approach. Bioinformatics 2012, 28(18):i61 1-1618. 
Yamanishi Y, Pauwels E, Saigo H, Stoven V: Extracting Sets of Chemical 
Substructures and Protein Domains Governing Drug-Target Interactions. 
Journal of Chemical Information and Modeling 201 1, 51:1 183-1 194. 
Tabei Y, Pauwels E, Stoven V, Takemoto K, Yamanishi Y: Identification of 
chemogenomic features from drug-target interaction networks using 
interpretable classifiers. Bioinformatics 2012, 28(18):i487-i494. 
Takigawa I, Tsuda K, Mamitsuka H: Mining Significant Substructure Pairs 
for Interpreting Polypharmacology in Drug-Target Network. PloS one 
2011, 6:e16999 

Mizutani S, Pauwels E Stoven V, Goto S, Yamanishi Y: Relating drug- 
protein interaction network with drug side effects. Bioinformatics 2012, 
28(18):i522-i528. 

Yamanishi Y, Pauwels E Kotera M: Drug Side-Effect Prediction Based on 
the Integration of Chemical and Biological Spaces. Journal of chemical 
information and modeling 2012, 52(12):3284-3292. 
Lounkine E, Keiser MJ, Whitebread S, Mikhailov D, Hamon J, Jenkins JU 
Lavan P, Weber E, Doak AK, Cote S, et al: Large-scale prediction and 



10 



11 



12 



13 



14. 



15 



16. 



Submit your next manuscript to BiolVled Central 
and tal<e full advantage of: 

• Convenient online submission 

• Thorough peer review 

• No space constraints or color figure charges 

• Immediate publication on acceptance 

• Inclusion in PubMed, CAS, Scopus and Google Scholar 

• Research which is freely available for redistribution 



Submit your manuscript at 
www.biomedcentral.com/submit 



BioMed Central 



